Income and Credit Analysis by Riley Emmons

General Dataset Info

I have chosen to explore the Prosper Loan Dataset. This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.

Here’s a link to a page with definitions for the variables used in this dataset: https://docs.google.com/spreadsheets/d/1gDyi_L4UvIrLTEC6Wri5nbaMmkGmLQBk-Yx3z0XDEtI/edit#gid=0

Univariate Plots Section

First, I’m going to run a few summaries to try and get a feel for my data. I’ll see what it looks like in general, and look for anything that stands out to me. This dataset has a lot of observations so there should be plenty of options.

##                    ListingKey     ListingNumber    
##  17A93590655669644DB4C06:     6   Min.   :      4  
##  349D3587495831350F0F648:     4   1st Qu.: 400919  
##  47C1359638497431975670B:     4   Median : 600554  
##  8474358854651984137201C:     4   Mean   : 627886  
##  DE8535960513435199406CE:     4   3rd Qu.: 892634  
##  04C13599434217079754AEE:     3   Max.   :1255725  
##  (Other)                :113912                    
##                     ListingCreationDate  CreditGrade         Term      
##  2013-10-02 17:20:16.550000000:     6          :84984   Min.   :12.00  
##  2013-08-28 20:31:41.107000000:     4   C      : 5649   1st Qu.:36.00  
##  2013-09-08 09:27:44.853000000:     4   D      : 5153   Median :36.00  
##  2013-12-06 05:43:13.830000000:     4   B      : 4389   Mean   :40.83  
##  2013-12-06 11:44:58.283000000:     4   AA     : 3509   3rd Qu.:36.00  
##  2013-08-21 07:25:22.360000000:     3   HR     : 3508   Max.   :60.00  
##  (Other)                      :113912   (Other): 6745                  
##                  LoanStatus                  ClosedDate   
##  Current              :56576                      :58848  
##  Completed            :38074   2014-03-04 00:00:00:  105  
##  Chargedoff           :11992   2014-02-19 00:00:00:  100  
##  Defaulted            : 5018   2014-02-11 00:00:00:   92  
##  Past Due (1-15 days) :  806   2012-10-30 00:00:00:   81  
##  Past Due (31-60 days):  363   2013-02-26 00:00:00:   78  
##  (Other)              : 1108   (Other)            :54633  
##   BorrowerAPR       BorrowerRate     LenderYield     
##  Min.   :0.00653   Min.   :0.0000   Min.   :-0.0100  
##  1st Qu.:0.15629   1st Qu.:0.1340   1st Qu.: 0.1242  
##  Median :0.20976   Median :0.1840   Median : 0.1730  
##  Mean   :0.21883   Mean   :0.1928   Mean   : 0.1827  
##  3rd Qu.:0.28381   3rd Qu.:0.2500   3rd Qu.: 0.2400  
##  Max.   :0.51229   Max.   :0.4975   Max.   : 0.4925  
##  NA's   :25                                          
##  EstimatedEffectiveYield EstimatedLoss   EstimatedReturn 
##  Min.   :-0.183          Min.   :0.005   Min.   :-0.183  
##  1st Qu.: 0.116          1st Qu.:0.042   1st Qu.: 0.074  
##  Median : 0.162          Median :0.072   Median : 0.092  
##  Mean   : 0.169          Mean   :0.080   Mean   : 0.096  
##  3rd Qu.: 0.224          3rd Qu.:0.112   3rd Qu.: 0.117  
##  Max.   : 0.320          Max.   :0.366   Max.   : 0.284  
##  NA's   :29084           NA's   :29084   NA's   :29084   
##  ProsperRating..numeric. ProsperRating..Alpha.  ProsperScore  
##  Min.   :1.000                  :29084         Min.   : 1.00  
##  1st Qu.:3.000           C      :18345         1st Qu.: 4.00  
##  Median :4.000           B      :15581         Median : 6.00  
##  Mean   :4.072           A      :14551         Mean   : 5.95  
##  3rd Qu.:5.000           D      :14274         3rd Qu.: 8.00  
##  Max.   :7.000           E      : 9795         Max.   :11.00  
##  NA's   :29084           (Other):12307         NA's   :29084  
##  ListingCategory..numeric. BorrowerState  
##  Min.   : 0.000            CA     :14717  
##  1st Qu.: 1.000            TX     : 6842  
##  Median : 1.000            NY     : 6729  
##  Mean   : 2.774            FL     : 6720  
##  3rd Qu.: 3.000            IL     : 5921  
##  Max.   :20.000                   : 5515  
##                            (Other):67493  
##                     Occupation         EmploymentStatus
##  Other                   :28617   Employed     :67322  
##  Professional            :13628   Full-time    :26355  
##  Computer Programmer     : 4478   Self-employed: 6134  
##  Executive               : 4311   Not available: 5347  
##  Teacher                 : 3759   Other        : 3806  
##  Administrative Assistant: 3688                : 2255  
##  (Other)                 :55456   (Other)      : 2718  
##  EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup
##  Min.   :  0.00           False:56459         False:101218    
##  1st Qu.: 26.00           True :57478         True : 12719    
##  Median : 67.00                                               
##  Mean   : 96.07                                               
##  3rd Qu.:137.00                                               
##  Max.   :755.00                                               
##  NA's   :7625                                                 
##                     GroupKey                 DateCreditPulled 
##                         :100596   2013-12-23 09:38:12:     6  
##  783C3371218786870A73D20:  1140   2013-11-21 09:09:41:     4  
##  3D4D3366260257624AB272D:   916   2013-12-06 05:43:16:     4  
##  6A3B336601725506917317E:   698   2014-01-14 20:17:49:     4  
##  FEF83377364176536637E50:   611   2014-02-09 12:14:41:     4  
##  C9643379247860156A00EC0:   342   2013-09-27 22:04:54:     3  
##  (Other)                :  9634   (Other)            :113912  
##  CreditScoreRangeLower CreditScoreRangeUpper
##  Min.   :  0.0         Min.   : 19.0        
##  1st Qu.:660.0         1st Qu.:679.0        
##  Median :680.0         Median :699.0        
##  Mean   :685.6         Mean   :704.6        
##  3rd Qu.:720.0         3rd Qu.:739.0        
##  Max.   :880.0         Max.   :899.0        
##  NA's   :591           NA's   :591          
##         FirstRecordedCreditLine CurrentCreditLines OpenCreditLines
##                     :   697     Min.   : 0.00      Min.   : 0.00  
##  1993-12-01 00:00:00:   185     1st Qu.: 7.00      1st Qu.: 6.00  
##  1994-11-01 00:00:00:   178     Median :10.00      Median : 9.00  
##  1995-11-01 00:00:00:   168     Mean   :10.32      Mean   : 9.26  
##  1990-04-01 00:00:00:   161     3rd Qu.:13.00      3rd Qu.:12.00  
##  1995-03-01 00:00:00:   159     Max.   :59.00      Max.   :54.00  
##  (Other)            :112389     NA's   :7604       NA's   :7604   
##  TotalCreditLinespast7years OpenRevolvingAccounts
##  Min.   :  2.00             Min.   : 0.00        
##  1st Qu.: 17.00             1st Qu.: 4.00        
##  Median : 25.00             Median : 6.00        
##  Mean   : 26.75             Mean   : 6.97        
##  3rd Qu.: 35.00             3rd Qu.: 9.00        
##  Max.   :136.00             Max.   :51.00        
##  NA's   :697                                     
##  OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries   
##  Min.   :    0.0             Min.   :  0.000      Min.   :  0.000  
##  1st Qu.:  114.0             1st Qu.:  0.000      1st Qu.:  2.000  
##  Median :  271.0             Median :  1.000      Median :  4.000  
##  Mean   :  398.3             Mean   :  1.435      Mean   :  5.584  
##  3rd Qu.:  525.0             3rd Qu.:  2.000      3rd Qu.:  7.000  
##  Max.   :14985.0             Max.   :105.000      Max.   :379.000  
##                              NA's   :697          NA's   :1159     
##  CurrentDelinquencies AmountDelinquent   DelinquenciesLast7Years
##  Min.   : 0.0000      Min.   :     0.0   Min.   : 0.000         
##  1st Qu.: 0.0000      1st Qu.:     0.0   1st Qu.: 0.000         
##  Median : 0.0000      Median :     0.0   Median : 0.000         
##  Mean   : 0.5921      Mean   :   984.5   Mean   : 4.155         
##  3rd Qu.: 0.0000      3rd Qu.:     0.0   3rd Qu.: 3.000         
##  Max.   :83.0000      Max.   :463881.0   Max.   :99.000         
##  NA's   :697          NA's   :7622       NA's   :990            
##  PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance
##  Min.   : 0.0000          Min.   : 0.000            Min.   :      0       
##  1st Qu.: 0.0000          1st Qu.: 0.000            1st Qu.:   3121       
##  Median : 0.0000          Median : 0.000            Median :   8549       
##  Mean   : 0.3126          Mean   : 0.015            Mean   :  17599       
##  3rd Qu.: 0.0000          3rd Qu.: 0.000            3rd Qu.:  19521       
##  Max.   :38.0000          Max.   :20.000            Max.   :1435667       
##  NA's   :697              NA's   :7604              NA's   :7604          
##  BankcardUtilization AvailableBankcardCredit  TotalTrades    
##  Min.   :0.000       Min.   :     0          Min.   :  0.00  
##  1st Qu.:0.310       1st Qu.:   880          1st Qu.: 15.00  
##  Median :0.600       Median :  4100          Median : 22.00  
##  Mean   :0.561       Mean   : 11210          Mean   : 23.23  
##  3rd Qu.:0.840       3rd Qu.: 13180          3rd Qu.: 30.00  
##  Max.   :5.950       Max.   :646285          Max.   :126.00  
##  NA's   :7604        NA's   :7544            NA's   :7544    
##  TradesNeverDelinquent..percentage. TradesOpenedLast6Months
##  Min.   :0.000                      Min.   : 0.000         
##  1st Qu.:0.820                      1st Qu.: 0.000         
##  Median :0.940                      Median : 0.000         
##  Mean   :0.886                      Mean   : 0.802         
##  3rd Qu.:1.000                      3rd Qu.: 1.000         
##  Max.   :1.000                      Max.   :20.000         
##  NA's   :7544                       NA's   :7544           
##  DebtToIncomeRatio         IncomeRange    IncomeVerifiable
##  Min.   : 0.000    $25,000-49,999:32192   False:  8669    
##  1st Qu.: 0.140    $50,000-74,999:31050   True :105268    
##  Median : 0.220    $100,000+     :17337                   
##  Mean   : 0.276    $75,000-99,999:16916                   
##  3rd Qu.: 0.320    Not displayed : 7741                   
##  Max.   :10.010    $1-24,999     : 7274                   
##  NA's   :8554      (Other)       : 1427                   
##  StatedMonthlyIncome                    LoanKey       TotalProsperLoans
##  Min.   :      0     CB1B37030986463208432A1:     6   Min.   :0.00     
##  1st Qu.:   3200     2DEE3698211017519D7333F:     4   1st Qu.:1.00     
##  Median :   4667     9F4B37043517554537C364C:     4   Median :1.00     
##  Mean   :   5608     D895370150591392337ED6D:     4   Mean   :1.42     
##  3rd Qu.:   6825     E6FB37073953690388BC56D:     4   3rd Qu.:2.00     
##  Max.   :1750003     0D8F37036734373301ED419:     3   Max.   :8.00     
##                      (Other)                :113912   NA's   :91852    
##  TotalProsperPaymentsBilled OnTimeProsperPayments
##  Min.   :  0.00             Min.   :  0.00       
##  1st Qu.:  9.00             1st Qu.:  9.00       
##  Median : 16.00             Median : 15.00       
##  Mean   : 22.93             Mean   : 22.27       
##  3rd Qu.: 33.00             3rd Qu.: 32.00       
##  Max.   :141.00             Max.   :141.00       
##  NA's   :91852              NA's   :91852        
##  ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
##  Min.   : 0.00                       Min.   : 0.00                  
##  1st Qu.: 0.00                       1st Qu.: 0.00                  
##  Median : 0.00                       Median : 0.00                  
##  Mean   : 0.61                       Mean   : 0.05                  
##  3rd Qu.: 0.00                       3rd Qu.: 0.00                  
##  Max.   :42.00                       Max.   :21.00                  
##  NA's   :91852                       NA's   :91852                  
##  ProsperPrincipalBorrowed ProsperPrincipalOutstanding
##  Min.   :    0            Min.   :    0              
##  1st Qu.: 3500            1st Qu.:    0              
##  Median : 6000            Median : 1627              
##  Mean   : 8472            Mean   : 2930              
##  3rd Qu.:11000            3rd Qu.: 4127              
##  Max.   :72499            Max.   :23451              
##  NA's   :91852            NA's   :91852              
##  ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
##  Min.   :-209.00             Min.   :   0.0           
##  1st Qu.: -35.00             1st Qu.:   0.0           
##  Median :  -3.00             Median :   0.0           
##  Mean   :  -3.22             Mean   : 152.8           
##  3rd Qu.:  25.00             3rd Qu.:   0.0           
##  Max.   : 286.00             Max.   :2704.0           
##  NA's   :95009                                        
##  LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination   LoanNumber    
##  Min.   : 0.00                 Min.   :  0.0              Min.   :     1  
##  1st Qu.: 9.00                 1st Qu.:  6.0              1st Qu.: 37332  
##  Median :14.00                 Median : 21.0              Median : 68599  
##  Mean   :16.27                 Mean   : 31.9              Mean   : 69444  
##  3rd Qu.:22.00                 3rd Qu.: 65.0              3rd Qu.:101901  
##  Max.   :44.00                 Max.   :100.0              Max.   :136486  
##  NA's   :96985                                                            
##  LoanOriginalAmount          LoanOriginationDate LoanOriginationQuarter
##  Min.   : 1000      2014-01-22 00:00:00:   491   Q4 2013:14450         
##  1st Qu.: 4000      2013-11-13 00:00:00:   490   Q1 2014:12172         
##  Median : 6500      2014-02-19 00:00:00:   439   Q3 2013: 9180         
##  Mean   : 8337      2013-10-16 00:00:00:   434   Q2 2013: 7099         
##  3rd Qu.:12000      2014-01-28 00:00:00:   339   Q3 2012: 5632         
##  Max.   :35000      2013-09-24 00:00:00:   316   Q2 2012: 5061         
##                     (Other)            :111428   (Other):60343         
##                    MemberKey      MonthlyLoanPayment LP_CustomerPayments
##  63CA34120866140639431C9:     9   Min.   :   0.0     Min.   :   -2.35   
##  16083364744933457E57FB9:     8   1st Qu.: 131.6     1st Qu.: 1005.76   
##  3A2F3380477699707C81385:     8   Median : 217.7     Median : 2583.83   
##  4D9C3403302047712AD0CDD:     8   Mean   : 272.5     Mean   : 4183.08   
##  739C338135235294782AE75:     8   3rd Qu.: 371.6     3rd Qu.: 5548.40   
##  7E1733653050264822FAA3D:     8   Max.   :2251.5     Max.   :40702.39   
##  (Other)                :113888                                         
##  LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees   
##  Min.   :    0.0              Min.   :   -2.35   Min.   :-664.87  
##  1st Qu.:  500.9              1st Qu.:  274.87   1st Qu.: -73.18  
##  Median : 1587.5              Median :  700.84   Median : -34.44  
##  Mean   : 3105.5              Mean   : 1077.54   Mean   : -54.73  
##  3rd Qu.: 4000.0              3rd Qu.: 1458.54   3rd Qu.: -13.92  
##  Max.   :35000.0              Max.   :15617.03   Max.   :  32.06  
##                                                                   
##  LP_CollectionFees  LP_GrossPrincipalLoss LP_NetPrincipalLoss
##  Min.   :-9274.75   Min.   :  -94.2       Min.   : -954.5    
##  1st Qu.:    0.00   1st Qu.:    0.0       1st Qu.:    0.0    
##  Median :    0.00   Median :    0.0       Median :    0.0    
##  Mean   :  -14.24   Mean   :  700.4       Mean   :  681.4    
##  3rd Qu.:    0.00   3rd Qu.:    0.0       3rd Qu.:    0.0    
##  Max.   :    0.00   Max.   :25000.0       Max.   :25000.0    
##                                                              
##  LP_NonPrincipalRecoverypayments PercentFunded    Recommendations   
##  Min.   :    0.00                Min.   :0.7000   Min.   : 0.00000  
##  1st Qu.:    0.00                1st Qu.:1.0000   1st Qu.: 0.00000  
##  Median :    0.00                Median :1.0000   Median : 0.00000  
##  Mean   :   25.14                Mean   :0.9986   Mean   : 0.04803  
##  3rd Qu.:    0.00                3rd Qu.:1.0000   3rd Qu.: 0.00000  
##  Max.   :21117.90                Max.   :1.0125   Max.   :39.00000  
##                                                                     
##  InvestmentFromFriendsCount InvestmentFromFriendsAmount   Investors      
##  Min.   : 0.00000           Min.   :    0.00            Min.   :   1.00  
##  1st Qu.: 0.00000           1st Qu.:    0.00            1st Qu.:   2.00  
##  Median : 0.00000           Median :    0.00            Median :  44.00  
##  Mean   : 0.02346           Mean   :   16.55            Mean   :  80.48  
##  3rd Qu.: 0.00000           3rd Qu.:    0.00            3rd Qu.: 115.00  
##  Max.   :33.00000           Max.   :25000.00            Max.   :1189.00  
## 

So I knew that there were a lot of observations, but looking through that summary gave me a better idea of how many there are, it’s a bit overwhelming. For now I’m going to pick out a more manigable number of variables that are interesting, run some anylsis on them, and then I can add more back later if I“m so inclined.

##  [1] "ListingKey"                         
##  [2] "ListingNumber"                      
##  [3] "ListingCreationDate"                
##  [4] "CreditGrade"                        
##  [5] "Term"                               
##  [6] "LoanStatus"                         
##  [7] "ClosedDate"                         
##  [8] "BorrowerAPR"                        
##  [9] "BorrowerRate"                       
## [10] "LenderYield"                        
## [11] "EstimatedEffectiveYield"            
## [12] "EstimatedLoss"                      
## [13] "EstimatedReturn"                    
## [14] "ProsperRating..numeric."            
## [15] "ProsperRating..Alpha."              
## [16] "ProsperScore"                       
## [17] "ListingCategory..numeric."          
## [18] "BorrowerState"                      
## [19] "Occupation"                         
## [20] "EmploymentStatus"                   
## [21] "EmploymentStatusDuration"           
## [22] "IsBorrowerHomeowner"                
## [23] "CurrentlyInGroup"                   
## [24] "GroupKey"                           
## [25] "DateCreditPulled"                   
## [26] "CreditScoreRangeLower"              
## [27] "CreditScoreRangeUpper"              
## [28] "FirstRecordedCreditLine"            
## [29] "CurrentCreditLines"                 
## [30] "OpenCreditLines"                    
## [31] "TotalCreditLinespast7years"         
## [32] "OpenRevolvingAccounts"              
## [33] "OpenRevolvingMonthlyPayment"        
## [34] "InquiriesLast6Months"               
## [35] "TotalInquiries"                     
## [36] "CurrentDelinquencies"               
## [37] "AmountDelinquent"                   
## [38] "DelinquenciesLast7Years"            
## [39] "PublicRecordsLast10Years"           
## [40] "PublicRecordsLast12Months"          
## [41] "RevolvingCreditBalance"             
## [42] "BankcardUtilization"                
## [43] "AvailableBankcardCredit"            
## [44] "TotalTrades"                        
## [45] "TradesNeverDelinquent..percentage." 
## [46] "TradesOpenedLast6Months"            
## [47] "DebtToIncomeRatio"                  
## [48] "IncomeRange"                        
## [49] "IncomeVerifiable"                   
## [50] "StatedMonthlyIncome"                
## [51] "LoanKey"                            
## [52] "TotalProsperLoans"                  
## [53] "TotalProsperPaymentsBilled"         
## [54] "OnTimeProsperPayments"              
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"    
## [57] "ProsperPrincipalBorrowed"           
## [58] "ProsperPrincipalOutstanding"        
## [59] "ScorexChangeAtTimeOfListing"        
## [60] "LoanCurrentDaysDelinquent"          
## [61] "LoanFirstDefaultedCycleNumber"      
## [62] "LoanMonthsSinceOrigination"         
## [63] "LoanNumber"                         
## [64] "LoanOriginalAmount"                 
## [65] "LoanOriginationDate"                
## [66] "LoanOriginationQuarter"             
## [67] "MemberKey"                          
## [68] "MonthlyLoanPayment"                 
## [69] "LP_CustomerPayments"                
## [70] "LP_CustomerPrincipalPayments"       
## [71] "LP_InterestandFees"                 
## [72] "LP_ServiceFees"                     
## [73] "LP_CollectionFees"                  
## [74] "LP_GrossPrincipalLoss"              
## [75] "LP_NetPrincipalLoss"                
## [76] "LP_NonPrincipalRecoverypayments"    
## [77] "PercentFunded"                      
## [78] "Recommendations"                    
## [79] "InvestmentFromFriendsCount"         
## [80] "InvestmentFromFriendsAmount"        
## [81] "Investors"

Out of those 81, below is the list of the ones I decided to keep. NExt I’ll subset my df and run some more summaries.

##    LoanNumber      CreditGrade         Term      
##  Min.   :     1          :84984   Min.   :12.00  
##  1st Qu.: 37332   C      : 5649   1st Qu.:36.00  
##  Median : 68599   D      : 5153   Median :36.00  
##  Mean   : 69444   B      : 4389   Mean   :40.83  
##  3rd Qu.:101901   AA     : 3509   3rd Qu.:36.00  
##  Max.   :136486   HR     : 3508   Max.   :60.00  
##                   (Other): 6745                  
##                  LoanStatus     BorrowerAPR       BorrowerRate   
##  Current              :56576   Min.   :0.00653   Min.   :0.0000  
##  Completed            :38074   1st Qu.:0.15629   1st Qu.:0.1340  
##  Chargedoff           :11992   Median :0.20976   Median :0.1840  
##  Defaulted            : 5018   Mean   :0.21883   Mean   :0.1928  
##  Past Due (1-15 days) :  806   3rd Qu.:0.28381   3rd Qu.:0.2500  
##  Past Due (31-60 days):  363   Max.   :0.51229   Max.   :0.4975  
##  (Other)              : 1108   NA's   :25                        
##   LenderYield       ProsperScore   BorrowerState  
##  Min.   :-0.0100   Min.   : 1.00   CA     :14717  
##  1st Qu.: 0.1242   1st Qu.: 4.00   TX     : 6842  
##  Median : 0.1730   Median : 6.00   NY     : 6729  
##  Mean   : 0.1827   Mean   : 5.95   FL     : 6720  
##  3rd Qu.: 0.2400   3rd Qu.: 8.00   IL     : 5921  
##  Max.   : 0.4925   Max.   :11.00          : 5515  
##                    NA's   :29084   (Other):67493  
##                     Occupation    IsBorrowerHomeowner
##  Other                   :28617   False:56459        
##  Professional            :13628   True :57478        
##  Computer Programmer     : 4478                      
##  Executive               : 4311                      
##  Teacher                 : 3759                      
##  Administrative Assistant: 3688                      
##  (Other)                 :55456                      
##  CreditScoreRangeLower CreditScoreRangeUpper
##  Min.   :  0.0         Min.   : 19.0        
##  1st Qu.:660.0         1st Qu.:679.0        
##  Median :680.0         Median :699.0        
##  Mean   :685.6         Mean   :704.6        
##  3rd Qu.:720.0         3rd Qu.:739.0        
##  Max.   :880.0         Max.   :899.0        
##  NA's   :591           NA's   :591          
##         FirstRecordedCreditLine CurrentCreditLines OpenCreditLines
##                     :   697     Min.   : 0.00      Min.   : 0.00  
##  1993-12-01 00:00:00:   185     1st Qu.: 7.00      1st Qu.: 6.00  
##  1994-11-01 00:00:00:   178     Median :10.00      Median : 9.00  
##  1995-11-01 00:00:00:   168     Mean   :10.32      Mean   : 9.26  
##  1990-04-01 00:00:00:   161     3rd Qu.:13.00      3rd Qu.:12.00  
##  1995-03-01 00:00:00:   159     Max.   :59.00      Max.   :54.00  
##  (Other)            :112389     NA's   :7604       NA's   :7604   
##  TotalCreditLinespast7years OpenRevolvingAccounts TotalInquiries   
##  Min.   :  2.00             Min.   : 0.00         Min.   :  0.000  
##  1st Qu.: 17.00             1st Qu.: 4.00         1st Qu.:  2.000  
##  Median : 25.00             Median : 6.00         Median :  4.000  
##  Mean   : 26.75             Mean   : 6.97         Mean   :  5.584  
##  3rd Qu.: 35.00             3rd Qu.: 9.00         3rd Qu.:  7.000  
##  Max.   :136.00             Max.   :51.00         Max.   :379.000  
##  NA's   :697                                      NA's   :1159     
##  CurrentDelinquencies AmountDelinquent   IncomeVerifiable
##  Min.   : 0.0000      Min.   :     0.0   False:  8669    
##  1st Qu.: 0.0000      1st Qu.:     0.0   True :105268    
##  Median : 0.0000      Median :     0.0                   
##  Mean   : 0.5921      Mean   :   984.5                   
##  3rd Qu.: 0.0000      3rd Qu.:     0.0                   
##  Max.   :83.0000      Max.   :463881.0                   
##  NA's   :697          NA's   :7622                       
##  StatedMonthlyIncome MonthlyLoanPayment
##  Min.   :      0     Min.   :   0.0    
##  1st Qu.:   3200     1st Qu.: 131.6    
##  Median :   4667     Median : 217.7    
##  Mean   :   5608     Mean   : 272.5    
##  3rd Qu.:   6825     3rd Qu.: 371.6    
##  Max.   :1750003     Max.   :2251.5    
## 

First thing that stands out to me in this is the dramatic difference between the median and max values for most of the credit related variables. The open Revolving accounts has a median 6 and a max of 51. The Total Inquiries has a median of 4 and an astounding max of 379! I’ll definitely look a little more closely. Also, Our credit rating seems to be mostly blank values, so that’ll be a problem later.

As I suspected from the summary, there must be very few outliers, with more than the third quartile. Perhaps only one becaeuse they’re not even showing up on this chart. It’s possible that there’s some erroneous data, or perhaps there’s one guy who really really likes making credit Inquiries, either way this doesn’t help me learn anything about most of my data, so next get a closer look at my data without those outliers. First I’ll just guess at some limits based my observations from this plot.

So this is much more interesting, already, we can see what our distrobution actually looks like. Next I’ll run some more summary statistics to get a better feel for it, and probably build one more plot based off that information.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   2.000   4.000   5.584   7.000 379.000    1159

In the above summary for total inquaries we can see that our 2rd quartle is seven despite having a max of 379. Next I’ll look at the 95th quantile.

## 95% 
##  16

looks like 95% of our observations have fewer than 16 Inquiries. I’ll build one more plot on this data using that limit.

Next I just want to see if the other credit metrics follow the same pattern, it would make since that total number of credit accounts distribute the same way that inquiries do.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    6.00    9.00    9.26   12.00   54.00    7604
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    4.00    6.00    6.97    9.00   51.00

The graphs for Open Credit Lines and Open Revolving Accounts are very similar to each other, but they do not match the total inquries as well as I thought they might. They are not skewed near as heavily left.

This is intersting, it looks like most people have very few delinquincies with again a small number of outliers. Let’s refine this plot a bit.

Looks like most people in this data set don’t have any current delinquencies, so WTG most people! Next we’ll look at credit ratings. First, just a quick bar chart to see what we’ve got. The previous variables have all been continuous, so histograms were used, however this value is discrete so I’ll be using a bar chart.

Now it’s time that that blank data for Credit Grades is a problem. I looked at the definitions, and it turns out we only have this data for pre 2009 records. Because I don’t have any way of aquiring them, for now I’ll just redo this plot without them to get a closer look at the values we do have.

Removing the blanks cuts down our observations significantly, which isn’t great. That being said, we can still make some better observations this way. Looks like most of our data set has a rating of C, and very few have no credit. A reasonable hypothesis to explain this is that one would be less likely to apply for a loan if one has no credit.

Next I want to look at Stated Monthly Income by Occupation. I’m going to do a quick summary of it.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750003

Just like with some of our credit metrics, we have a max value way outside of our 3rd quartile. I’ll use the 95th quantile for our graph.

Interesting spikes in our dataset, it makes since that people would report generally round numbers for their income. Shout out to everyone who seems to have gotten a loan and reported 0 income.

Lastly for my Univariate exploration I’ll look at occupation data. I’d like to compare occupations in my upcoming bivariate exploration so it’ll be helpful to have an understanding of the distrobution. First I’ll see how many unique occupations we’re working with.

## [1] "There are 68 Occupations"

That’s a lot, so I probably won’t analyze all of them, but I want to get a feel for what the distrobution looks like, to see if there’s anything of which I should be aware.

So our first and second largest bars are “Other” and “Professional” respectively. I’ll leave thoes out of my analyisis because those aren’t descriptive enough.

Univariate Analysis

What is the structure of your dataset?

This is a huge dataset with as many variables and observations as anyone could hope for. There’s a lot to explore here, and plenty more interesting things to notice than I had time to look into.

Key Observations: Most of the credit related metrics have huge outliers. The median amount of open revolving acounts is 6. The median amount of open credit accounts is 9. Most Observations have no delinquencies. Most of the data set is missing it’s credit ratings. Of the credit ratings we have, C is the most common.

What is/are the main feature(s) of interest in your dataset?

The main features in this dataset in which I am interested are the various metrics of credit, namely credit rating. I am interested in seeing which other variables might serve as indicators of credit, such as occupation.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Occupation is the main variable I’ll look into but others such as term limit and APR should have some corrilations.

Did you create any new variables from existing variables in the dataset?

No, this dataset has plenty of variables, so I’ve yet to find it necessary.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Some of the distrobutions were made unusually by some extreme outliers. The only changes I made were to remove outliers and missing values from certain variables.

Bivariate Plots Section

For this section I’m going to be making a lot of comparisons around Occupations. I’m going to randomly sample ten occupations, and make sure I don’t get “Other” or “Professional”.

##  [1] Executive                          Laborer                           
##  [3] Sales - Commission                 Nurse (LPN)                       
##  [5] Police Officer/Correction Officer  Medical Technician                
##  [7] Chemist                            Student - College Graduate Student
##  [9] Student - College Sophomore        Student - College Freshman        
## 68 Levels:  Accountant/CPA Administrative Assistant Analyst ... Waiter/Waitress

Alright, so there’s my sample occupations, we’ll just look at these occupations. I made another dataframe with just the observations within the selected occupations.

So if I’d thought about it a bit more I’d have realized that this would be what that plot looked like - I still think it’s interesting though. Every ocuptaion has every credit raiting, exept for the ones that don’t have No Credit, (double negative is correct in this context) and the students that don’t have the higest credit ratings.

So this one is much more interesting. At first there were a couple zero values. Although I’m not an expert in credit scores, I do know zero vaules aren’t actually attainable, so I removed them from the second plot. You can see a bit tighter groupings for the students, which makes sense as a student theroretically would have had less time to get really good credit, or to really ruin it. Other occupations range pretty far. Transparency is applied to this plot such that the darker the dot the more values there are. You can see how our executives have a higher consentration of higher credit.

## [1] "Summary of Executives"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   440.0   660.0   700.0   702.9   740.0   880.0
## [1] "Summary of Laborers"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   420.0   640.0   680.0   672.7   700.0   860.0
## [1] "Summary of Sales - Commission"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     440     640     680     682     720     880

I would have expected there to be a but more variation by occupation here. I would’ve thought that executives would be more scewed toward the right, ie more of them would have higher credit scores. As you can see they have a similar distrobution to Sales - Commissions and Labor, they do however have a higher credit score of about 30 on Average. Not really a lot of counts for college students, which makes since that most people don’t apply for this type of loan in college.

Looks about like I expected this time, some people in sales are doing well for themselves, but the average is closer to where the Laborers are. Executives certainly have a broad range, but have more higher values than any other colum. Still not many students, and they don’t make much - especially Freshman. THe only thing that really surprises me here is Nurses. Personally I would’ve expected them to be more to the right.

For my last trick, I’d like to look at credit score vs income. I imagine that in general these should be pretty positively corrilated. It stands to reason that people who make more money will have better credit. Let’s see if that’s actually True.

There are our outliers again, let’s get rid of our Fancy Pants rich people and clean this plot up a bit.

## 
##  Pearson's product-moment correlation
## 
## data:  df_2$StatedMonthlyIncome and df_2$CreditScoreRangeLower
## t = 36.54, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1021433 0.1136511
## sample estimates:
##       cor 
## 0.1079008

So it looks income is not a good predictor of credit like I thought it might be. There are noticably fewer observations with low credit score and high income, (bottom right area) and there’s a noticable dip in values with high credit score and low income, (top left area) however overall there are plenty of high income people with lower credit scores and vise versa. Pearson’s R confirms that there is no meaningful corrilation.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

So my analysis was mostly finding out things I thought might have relationships didn’t. Occupation doesn’t predict credit score very well, and neither did stated monthly income. Occupation and income were the only metrics that made sense in that respect, and even there there’s a lot of variance.

What was the strongest relationship you found?

I didn’t really find any strong relationships, TO me the most interesting thing I found was the dips in the corners in the plot of credit score vs income.

Multivariate Plots Section

Let’s make a plot similar to the credit score vs stated monthly income, but this time we’ll color by occupation from our list of ten.

Alright, I’m pretty happy with this plot. We can see where the different occupations stack as measured by Credit Scores and Monthly Income. This plot suffers from overplotting however, so I’m going to facet it to clean it up a bit more.

You can see how students (pink and purple circles) have a fair range of credit scores, some are great, some not so much. Most of them are hanging out toward the low end of the payscale though. You can see how our executives are better off for the most part, even though there are plenty of them that make less than our sales commisions persons.

Next I’m going to make a similar plot but with loan term instead of occupation. Instinctively I want to say that long loan terms will accompany low income and credit scores, but I’ve been wrong before.

This one doesn’t really look like wehat I thought it would, and if anything, it’s darker at the bottom left and gets lighter, ie the terms get longer, which is the opposite of what I guessed earlier. I’m 0/3 now, but yay for data correcting erroneous notions.

I thought that perhaps it’d be interesting to see all four variables, faceted by occupation and colored by term limit. My previous observations hold true, and this plot ties it all together.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Seeing occupation, credit score, and income altogether was really interesting to me. Although this analysis didn’t end up being exactly about finding predictor variables, I did see how occupation and wages effect credit score.

Were there any interesting or surprising interactions between features?

The overall lack of corrilation between credit score and monthly income’s combined effect on term limit was suprising to me, I guess because shorter terms generally cost more upfront I expected them to exist mostly around low income individuals, but that didn’t appear to be the case.

Final Plots and Summary

Plot One

Description One

This plot from my first univarient anlysis stood out to me. It was intersting to see how quickly the number of inquries declined for most of the population, but still how some people seem to make multiple inquiries a day.

Plot Two

Description Two

I thought this plot was noteworthy, it was interesting to see how incomes were dispersed, particularly the spikes at standard seeming values. It was also interesting to compare the distrobutions of the different job types.

Plot Three

Description Three

This Final plot pulls everything together, and shows how Credit Score, Monthly Income, Loan Term, and Occupation all work together. It paints a picture of these elements in an interesting and descriptive way.


Reflection

How Was This Analysis

This was a really fun dataset to study and learn about. Exploring income data and how that relates to loans was interesting, there are a lot of subtle things happening that are interesting to investigate. My buggest struggle with this was probably finding which variables made the most sense to investigate. Having a dataset this large is good in some ways because you have a lot to work with, but it can certantly complicate things and make a concise analysis difficult. Overall I’m happy with what I was able to learn about, and the things I was able to find out about this dataset.

What Would I do Next Time?

There’s really so much more left to explore in this dataset, I could likely spend weeks working on it. Another interesting thing to see for future analysis would be to add state data. Seeing which states show up the most, which jobs are in which states, which states have more debt, etc. There are plenty of unanswered questions for next time.